Skip to content

perf(player): p0-1c live-playback parity test via SSIM#401

Merged
vanceingalls merged 1 commit intomainfrom
perf/p0-1c-live-playback-parity-test
Apr 23, 2026
Merged

perf(player): p0-1c live-playback parity test via SSIM#401
vanceingalls merged 1 commit intomainfrom
perf/p0-1c-live-playback-parity-test

Conversation

@vanceingalls
Copy link
Copy Markdown
Collaborator

@vanceingalls vanceingalls commented Apr 21, 2026

Summary

Adds scenario 06: live-playback parity — the third and final tranche of the P0-1 perf-test buildout (p0-1a infra → p0-1b fps/scrub/drift → this).

The scenario plays the gsap-heavy fixture, freezes it mid-animation, screenshots the live frame, then synchronously seeks the same player back to that exact timestamp and screenshots the reference. The two PNGs are diffed with ffmpeg -lavfi ssim and the resulting average SSIM is emitted as parity_ssim_min. Baseline gate: SSIM ≥ 0.95.

This pins the player's two frame-production paths (the runtime's animation loop vs. _trySyncSeek) to each other visually, so any future drift between scrub and playback fails CI instead of silently shipping.

Motivation

<hyperframes-player> produces frames two different ways:

  1. Live playback — the runtime's animation loop advances the GSAP timeline frame-by-frame.
  2. Synchronous seek (_trySyncSeek, landed in feat(player): synchronous seek() API with same-origin detection #397) — for same-origin embeds, the player calls into the iframe runtime's seek() directly and asks for a specific time.

These paths must agree. If they don't — different rounding, different sub-frame sampling, different state ordering — scrubbing a paused composition shows different pixels than a paused-during-playback frame at the same time. That's a class of bug that only surfaces visually, never in unit tests, and only at specific timestamps where many things are mid-flight.

gsap-heavy is a 10s composition with 60 tiles each running a staggered 4s out-and-back tween. At t=5.0s a large fraction of those tiles are mid-flight, so the rendered frame has many distinct, position-sensitive pixels — the worst-case input for any sub-frame disagreement. If the two paths produce identical pixels here, they'll produce identical pixels everywhere that matters.

What changed

  • packages/player/tests/perf/scenarios/06-parity.ts — new scenario (~340 lines). Owns capture, seek, screenshot, SSIM, artifact persistence, and aggregation.
  • packages/player/tests/perf/index.ts — register parity as a scenario id, default-runs = 3, dispatch to runParity, include in the default scenario list.
  • packages/player/tests/perf/perf-gate.ts — extend PerfBaseline with paritySsimMin.
  • packages/player/tests/perf/baseline.jsonparitySsimMin: 0.95.
  • .github/workflows/player-perf.yml — add a parity shard (3 runs) to the matrix alongside load / fps / scrub / drift.

How the scenario works

The hard part is making the two captures land on the exact same timestamp without trusting postMessage round-trips or arbitrary setTimeout settling.

  1. Install an iframe-side rAF watcher before issuing play(). The watcher polls __player.getTime() every animation frame and, the first time getTime() >= 5.0, calls __player.pause() from inside the same rAF tick. pause() is synchronous (it calls timeline.pause()), so the timeline freezes at exactly that getTime() value with no postMessage round-trip. The watcher's Promise resolves with that frozen value as the canonical T_actual for the run.
  2. Confirm isPlaying() === true via frame.waitForFunction before awaiting the watcher. Without this, the test can hang if play() hasn't kicked the timeline yet.
  3. Wait for paint — two requestAnimationFrame ticks on the host page. The first flushes pending style/layout, the second guarantees a painted compositor commit. Same paint-settlement pattern as packages/producer/src/parity-harness.ts.
  4. Screenshot the live framepage.screenshot({ type: "png" }).
  5. Synchronously seek to T_actual — call el.seek(capturedTime) on the host page. The player's public seek() calls _trySyncSeek which (same-origin) calls __player.seek() synchronously, so no postMessage await is needed. The runtime's deterministic seek() rebuilds frame state at exactly the requested time.
  6. Wait for paint again, screenshot the reference frame.
  7. Diff with ffmpegffmpeg -hide_banner -i reference.png -i actual.png -lavfi ssim -f null -. ffmpeg writes per-channel + overall SSIM to stderr; we parse the All: value, clamp at 1.0 (ffmpeg occasionally reports 1.000001 on identical inputs), and treat it as the run's score.
  8. Persist artifacts under tests/perf/results/parity/run-N/ (actual.png, reference.png, captured-time.txt) so CI can upload them and so a failed run is locally reproducible. Directory is already gitignored via the existing packages/player/tests/perf/results/ rule.

Aggregation

min() across runs, not mean. We want the worst observed parity to pass the gate so a single bad run can't get masked by averaging. Both per-run scores and the aggregate are logged.

Output metric

name direction baseline
parity_ssim_min higher-is-better paritySsimMin: 0.95

With deterministic rendering enabled in the runner, identical pixels produce SSIM very close to 1.0; the 0.95 threshold leaves headroom for legitimate fixture-level noise (font hinting, GPU compositor variance) while still catching any real disagreement between the two paths.

Test plan

  • bun run player:perf -- --scenarios=parity --runs=3 locally on gsap-heavy — passes with SSIM ≈ 0.999 across all 3 runs.
  • Inspected results/parity/run-1/actual.png and reference.png side-by-side — visually identical.
  • Inspected captured-time.txt to confirm T_actual lands just past 5.0s (within one frame).
  • Sanity test: temporarily forced a 1-frame offset between live and reference capture; SSIM dropped well below 0.95 as expected, confirming the threshold catches real drift.
  • CI: parity shard added alongside the existing load / fps / scrub / drift shards; same measure-mode / artifact-upload / aggregation flow.
  • bunx oxlint and bunx oxfmt --check clean on the new scenario.

Stack

This is the top of the perf stack:

  1. feat(core): add emitPerformanceMetric bridge for runtime telemetry #393 perf/x-1-emit-performance-metric — performance.measure() emission
  2. perf(player): share PLAYER_STYLES via adoptedStyleSheets #394 perf/p1-1-share-player-styles-via-adopted-stylesheets — adopted stylesheets
  3. perf(player): scope MutationObserver to composition hosts #395 perf/p1-2-scope-media-mutation-observer — scoped MutationObserver
  4. perf(player): coalesce _mirrorParentMediaTime writes #396 perf/p1-4-coalesce-mirror-parent-media-time — coalesce currentTime writes
  5. feat(player): synchronous seek() API with same-origin detection #397 perf/p3-1-sync-seek-same-origin — synchronous seek path (the path this PR pins)
  6. perf(player): srcdoc composition switching for studio #398 perf/p3-2-srcdoc-composition-switching — srcdoc switching
  7. perf(player): p0-1a perf test infra + composition-load smoke test #399 perf/p0-1a-perf-test-infra — server, runner, perf-gate, CI
  8. perf(player): p0-1b perf tests for fps, scrub latency, and media sync drift #400 perf/p0-1b-perf-tests-for-fps-scrub-drift — fps / scrub / drift scenarios
  9. perf(player): p0-1c live-playback parity test via SSIM #401 perf/p0-1c-live-playback-parity-test ← you are here

With this PR landed the perf harness covers all five proposal scenarios: load, fps, scrub, drift, parity.

Copy link
Copy Markdown
Collaborator

@jrusso1020 jrusso1020 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scenario design itself is excellent — playing the gsap-heavy fixture, freezing mid-animation via an in-iframe rAF watcher that calls __player.pause() in the same tick, screenshotting, then _trySyncSeek-ing back to the frozen timestamp and screenshotting a reference, then SSIM-diffing with ffmpeg. That's the right shape to pin the two frame-production paths (live animation vs sync seek) to each other, and the comments explaining why the pause has to happen in the same tick as the watcher match make the intent clear.

Blocking: CI is red. The Perf: parity shard fails on every run with:

error: [scenario:parity] ffmpeg ssim failed (exit=undefined):
    at computeSsim (packages/player/tests/perf/scenarios/06-parity.ts:163:15)

Two things wrong here:

  1. The error reporting swallows the real cause. result.status is undefined/null and stderr is empty, which means the child never even started (ENOENT or similar). The real diagnostic info is on result.error, which the current code doesn't surface:

    if (result.status !== 0) {
      const stderr = (result.stderr || Buffer.from("")).toString("utf-8");
      throw new Error(`[scenario:parity] ffmpeg ssim failed (exit=${result.status}): ${stderr}`);
    }

    Please change to something like:

    if (result.error) {
      throw new Error(`[scenario:parity] ffmpeg could not be started: ${result.error.message}`);
    }
    if (result.status !== 0) { ... }

    That turns "exit=undefined" into "ffmpeg not found" (or whichever real OS error is firing) on the next CI run, which tells you which of the follow-ups below is needed.

  2. ffmpeg availability on the runner. GitHub's ubuntu-latest (currently 24.04) does include ffmpeg in the pre-installed toolset, so in theory this should Just Work — but the failure pattern looks a lot like ENOENT, and on a Bun child_process polyfill an ENOENT may land as status: undefined with empty stderr (vs Node's status: null). Belt-and-braces fix: add an explicit install step in the parity shard's steps list, or at the top of the job:

    - name: Install ffmpeg for SSIM diff
      if: matrix.shard == 'parity'
      run: sudo apt-get update && sudo apt-get install -y ffmpeg

    Even if ffmpeg is usually present, this pins the scenario against toolset drift on the hosted runner image.

Once computeSsim surfaces the real error and CI turns green, happy to re-review and approve — the scenario itself looks good.

Non-blocking:

  • SSIM baseline of 0.95 is reasonable as a starting point, but depending on how deterministic the fixture's animation is at TARGET_TIME_S and how font/subpixel rendering jitters on the runner, you may want to widen that to 0.92–0.93 for the first few enforcement cycles. It's trivial to tighten later; a false-positive below 0.95 will be a thorny debug because the signal is pixels rather than numbers.
  • Consider writing the diffed pixel map to the results/ artifact directory on failure — when this does fire in anger, a human looking at an SSIM of 0.88 wants to see where the two frames disagree, and the SSIM map is a cheap byproduct of ffmpeg's ssim filter (output it via -lavfi ssim=stats_file=...).

Rames Jusso

@vanceingalls vanceingalls force-pushed the perf/p0-1c-live-playback-parity-test branch from 4129ab2 to 9542991 Compare April 22, 2026 00:43
@vanceingalls vanceingalls force-pushed the perf/p0-1b-perf-tests-for-fps-scrub-drift branch 2 times, most recently from 111e128 to 306c164 Compare April 22, 2026 00:57
@vanceingalls vanceingalls force-pushed the perf/p0-1c-live-playback-parity-test branch 2 times, most recently from c918563 to 83d15bb Compare April 22, 2026 01:38
@vanceingalls
Copy link
Copy Markdown
Collaborator Author

@jrusso1020 @miguel-heygen — both blockers resolved plus both non-blocking suggestions:

Blocker 1 — computeSsim swallowing the real cause (exit=undefined masking ENOENT): addressed in 83d15bb0. computeSsim now checks result.error before result.status — when ffmpeg never starts, the error line surfaces the actual ENOENT/EACCES/etc. with a hint to install ffmpeg, instead of the misleading (exit=undefined) you saw in CI. Old log line is now impossible to produce.

Blocker 2 — ffmpeg availability on the runner: addressed in .github/workflows/player-perf.yml (parity shard) and .github/workflows/windows-render.yml. The parity shard now installs ffmpeg explicitly:

- name: Install ffmpeg (parity shard only)
  if: matrix.shard == 'parity'
  run: |
    sudo apt-get update
    sudo apt-get install -y --no-install-recommends ffmpeg
    ffmpeg -version | head -n 1

Mirror step on the Windows job uses choco install ffmpeg -y --no-progress followed by a where.exe ffmpeg sanity check so the failure mode is "step fails loudly" rather than "scenario fails opaquely". Catalog previews use FedericoCarboni/setup-ffmpeg@v3 for the same reason. Even though ubuntu-latest (24.04) does ship ffmpeg in the toolset, pinning the install removes the runner-image dependency and matches Linux/Windows behaviour.

The non-blocking observations:

Consider writing the diffed pixel map to the results/ artifact directory on failure

Done. 06-parity.ts now has writeSsimStatsOnFailure which invokes ffmpeg with -lavfi "ssim=stats_file=…" on parse / mismatch failure and drops a parity-ssim-stats.txt per-frame breakdown into the run directory. When the shard fails, the artifact bundle now contains both reference + actual PNGs and the per-frame SSIM trace, so triage doesn't require local repro.

SSIM baseline of 0.95 is reasonable as a starting point, but ... you may want to widen that to 0.92–0.93

Adopted. baseline.json now sets paritySsimMin: 0.93. The 0.95 figure was synthetic (two captures of the same renderer under deterministic seek); 0.93 leaves headroom for legitimate sub-pixel jitter without losing the regression signal we care about (catastrophic divergence between live-playback and sync-seek paths).

Nothing else outstanding.

Copy link
Copy Markdown
Collaborator

@miguel-heygen miguel-heygen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-checked the live head and validated the requested-change blocker is fixed. The parity scenario now surfaces ffmpeg startup failures correctly, installs ffmpeg in CI, and the local player perf plus browser verification passed on the reviewed head.

@vanceingalls vanceingalls force-pushed the perf/p0-1c-live-playback-parity-test branch from 83d15bb to 49b9827 Compare April 22, 2026 22:20
@vanceingalls vanceingalls force-pushed the perf/p0-1b-perf-tests-for-fps-scrub-drift branch from 2256f55 to 3c347f7 Compare April 22, 2026 22:20
Copy link
Copy Markdown
Collaborator

@jrusso1020 jrusso1020 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reviewed at head (49b9827b). Both prior blockers addressed:

  1. computeSsim error-reporting fixresult.error is now checked before result.status, with a clear "ffmpeg could not be started (ENOENT)" message plus a pointer to install ffmpeg on the runner. No more confusing exit=undefined red herring.
  2. Explicit ffmpeg install on the parity shardapt-get install -y --no-install-recommends ffmpeg gated on matrix.shard == 'parity', with a ffmpeg -version | head -n 1 sanity print. Belt-and-braces against ubuntu-latest toolset drift.

Bonus pickups matching my non-blocking suggestions:

  • writeSsimStatsOnFailure re-invokes ffmpeg with ssim=stats_file=<runDir>/ssim-stats.log on both the non-zero-exit and parse-failure paths. The per-frame SSIM dump now rides along with the artifact upload — best possible bridge between "the gate tripped" and "which pixel region drifted" without pulling PNGs locally.
  • Baseline widened to paritySsimMin: 0.93 (vs the proposal's 0.95), with a JSDoc block explaining the 2-point cushion for sub-pixel rasterization wobble and a pointer to ratchet back to 0.95 once determinism tightens (e.g. fixed DPR + forced software raster). Exactly the "trivial to tighten later, thorny to debug a false-positive" trade I suggested.

All perf shards green on the current head (parity + load + fps + scrub + drift). Graphite mergeability still in-flight but non-blocking.

Review by pr-review

@vanceingalls vanceingalls force-pushed the perf/p0-1c-live-playback-parity-test branch from 49b9827 to a2eee64 Compare April 22, 2026 22:36
@vanceingalls vanceingalls force-pushed the perf/p0-1b-perf-tests-for-fps-scrub-drift branch 2 times, most recently from 1f13f80 to 621a276 Compare April 22, 2026 22:44
@vanceingalls vanceingalls force-pushed the perf/p0-1c-live-playback-parity-test branch from a2eee64 to 03fb792 Compare April 22, 2026 22:45
@vanceingalls vanceingalls force-pushed the perf/p0-1b-perf-tests-for-fps-scrub-drift branch from 621a276 to 84efd8a Compare April 22, 2026 23:29
@vanceingalls vanceingalls force-pushed the perf/p0-1c-live-playback-parity-test branch 2 times, most recently from 0a83a58 to edbd01f Compare April 22, 2026 23:42
@vanceingalls vanceingalls force-pushed the perf/p0-1b-perf-tests-for-fps-scrub-drift branch 2 times, most recently from 5a93e44 to 871c986 Compare April 22, 2026 23:49
@vanceingalls vanceingalls force-pushed the perf/p0-1c-live-playback-parity-test branch from edbd01f to 9014a92 Compare April 22, 2026 23:50
@vanceingalls vanceingalls force-pushed the perf/p0-1b-perf-tests-for-fps-scrub-drift branch from 871c986 to 224503d Compare April 23, 2026 00:46
@vanceingalls vanceingalls force-pushed the perf/p0-1c-live-playback-parity-test branch from 9014a92 to c05b99f Compare April 23, 2026 00:47
@vanceingalls vanceingalls force-pushed the perf/p0-1b-perf-tests-for-fps-scrub-drift branch from 224503d to f637194 Compare April 23, 2026 00:51
@vanceingalls vanceingalls force-pushed the perf/p0-1c-live-playback-parity-test branch from c05b99f to 6c6a360 Compare April 23, 2026 00:52
@vanceingalls vanceingalls force-pushed the perf/p0-1b-perf-tests-for-fps-scrub-drift branch from f637194 to 23e3dcb Compare April 23, 2026 00:59
@vanceingalls vanceingalls force-pushed the perf/p0-1c-live-playback-parity-test branch from 6c6a360 to ec31023 Compare April 23, 2026 01:00
vanceingalls added a commit that referenced this pull request Apr 23, 2026
## Summary

First slice of `P0-1` from the player perf proposal: lays the foundation for a player perf gate so later PRs can plug in fps / scrub / drift / parity scenarios without rebuilding infrastructure. Ships one smoke scenario (`03-load`, cold + warm composition load) to prove the gate end-to-end on real numbers.

## Why

There was no automated way to catch player perf regressions. Every perf concern in the existing proposal — composition load time, sustained FPS, scrub p95, mirror-clock drift, live-vs-seek parity — needs the same plumbing: a same-origin harness, a Puppeteer runner, a baseline file, a gate that emits structured results, and a CI workflow that runs the right scenarios on the right changes. Building that up-front in one reviewable PR lets every subsequent perf PR (`P0-1b`, `P0-1c`, and beyond) be a 100-line scenario file plus a baseline entry instead of re-litigating the framework.

## What changed

### Harness — `packages/player/tests/perf/server.ts`

- `Bun.serve` on a free port, single same-origin host for the player IIFE bundle, hyperframe runtime, GSAP from `node_modules`, and fixture HTML.
- Same-origin matters: cross-origin would force every probe through `postMessage`, hiding bugs and inflating numbers in ways production never sees. Tests should measure the path the studio editor actually takes.
- Routes:
  - `/player.js` → built IIFE bundle (rebuilt on demand).
  - `/vendor/runtime.js`, `/vendor/gsap.min.js` → resolved from `node_modules` so fixtures don't need to ship copies.
  - `/fixtures/*` → fixture HTML.

### Runner — `packages/player/tests/perf/runner.ts`

- `puppeteer-core` thin wrappers (`launchBrowser`, `loadHostPage`).
- Uses the system Chrome detected by `setup-chrome` in CI rather than the bundled puppeteer revision — keeps the action smaller, lets us pin Chrome version policy at the workflow level, and matches what users actually run.

### Gate — `packages/player/tests/perf/perf-gate.ts` + `baseline.json`

- Loads `baseline.json` (initial budgets: cold/warm comp load, fps, scrub p95 isolated/inline, drift max/p95) with a 10% `allowedRegressionRatio`.
- Per-metric direction (`lower-is-better` / `higher-is-better`) so the same evaluator handles latency and throughput.
- Returns a structured `GateReport` consumed by both the CLI (table output) and `metrics.json` (CI artifact).
- Two modes: `measure` (log only — used during the rollout) and `enforce` (fail the build) — flip per-metric once we trust the signal, without touching the harness.

### CLI orchestrator — `packages/player/tests/perf/index.ts`

- Parses `--mode` / `--scenarios` / `--runs` / `--fixture` in both space- and equals-separated form (so `--scenarios fps,scrub` and `--scenarios=fps,scrub` both work — matches what humans type and what GitHub Actions emits).
- Runs scenarios, runs the gate, and **always** writes `results/metrics.json` with schema version, git SHA, metrics, and gate rows — so failed runs are still investigable from the artifact alone.

### Fixture + smoke scenario

- `fixtures/gsap-heavy/index.html`: 200 stagger-animated tiles, no media. Heavy enough to make load time meaningful, light enough to be deterministic.
- `scenarios/03-load.ts`: cold + warm composition load. Measures from navigation start to player `ready` event, reports p95 across runs.

### CI — `.github/workflows/player-perf.yml`

- `paths-filter` on `player` / `core` / `runtime` — perf only runs when something that could move the needle actually changed.
- Sets up bun + node + chrome, runs perf in `measure` mode on a shard matrix (so future scenarios shard naturally), uploads `metrics.json` artifacts, and a summary job aggregates shard results into a single PR comment.

### Wiring

- `packages/player`: `puppeteer-core`, `gsap`, `@types/bun` devDeps; typecheck extended to cover the perf `tsconfig`; new `perf` script.
- Root `package.json`: `player:perf` workspace script so `bun run player:perf` runs the whole suite locally with the same flags CI uses.
- `.gitignore`: `packages/player/tests/perf/results/`.
- Separate `tests/perf/tsconfig.json` so test code doesn't pollute the package `rootDir` while still being typechecked.

## Test plan

- [x] Local: `bun run player:perf` passes — cold p95 ≈ 386 ms, warm p95 ≈ 375 ms, both well under the seeded baselines.
- [x] Typecheck, lint, format pass on the perf workspace.
- [x] Existing player unit tests (71/71) still green.
- [ ] First CI run after merge will be the real signal: confirms `setup-chrome` works on hosted runners, the shard matrix wires up, and `metrics.json` artifacts upload.

## Stack

Step `P0-1a` of the player perf proposal. The next two slices are content-only — they don't touch the harness:

- `P0-1b` (#400): adds `02-fps`, `04-scrub`, `05-drift` scenarios on a 10-video-grid fixture.
- `P0-1c` (#401): adds `06-parity` (live playback vs. synchronously-seeked reference, compared via SSIM).

Wiring this gate up first means each follow-up is a self-contained scenario file + baseline row + workflow shard.
@vanceingalls vanceingalls force-pushed the perf/p0-1b-perf-tests-for-fps-scrub-drift branch 2 times, most recently from 33150d2 to b9fd169 Compare April 23, 2026 01:04
@vanceingalls vanceingalls force-pushed the perf/p0-1c-live-playback-parity-test branch from ec31023 to 400409b Compare April 23, 2026 01:04
@vanceingalls vanceingalls changed the base branch from perf/p0-1b-perf-tests-for-fps-scrub-drift to graphite-base/401 April 23, 2026 01:10
vanceingalls added a commit that referenced this pull request Apr 23, 2026
… drift (#400)

## Summary

Second slice of `P0-1` from the player perf proposal: plugs the three steady-state scenarios — sustained playback FPS, scrub latency, and media-sync drift — into the perf gate that landed in #399. Adds the multi-video fixture they all share, wires three new shards into CI, and seeds one new baseline (`droppedFramesMax`).

## Why

#399 stood up the harness and proved it with a single load-time scenario. By itself that's enough to catch regressions in initial composition setup, but it can't catch the things players actually fail at in production:

- **FPS regressions** — a render-loop change that drops the ticker from 60 to 45 fps still loads fast.
- **Scrub latency regressions** — the inline-vs-isolated split (#397) is exactly the kind of code path where a refactor can silently push everyone back to the postMessage round trip.
- **Media drift** — runtime mirror logic (#396 in this stack) and per-frame scheduling tweaks can both cause video to slip out of sync with the composition clock without producing a single console error.

Each of these is a target metric in the proposal with a concrete budget. This PR turns those budgets into gated CI signals and produces continuous data for them on every player/core/runtime change.

## What changed

### Fixture — `packages/player/tests/perf/fixtures/10-video-grid/`

- `index.html`: 10-second composition, 1920×1080, 30 fps, with 10 simultaneously-decoding video tiles in a 5×2 grid plus a subtle GSAP scale "breath" on each tile (so the rAF/RVFC loops have real work to do without GSAP dominating the budget the decoder needs).
- `sample.mp4`: small (~190 KB) clip checked in so the fixture is hermetic — no external CDN dependency, identical bytes on every run.
- Same `data-composition-id="main"` host pattern as `gsap-heavy`, so the existing harness loader works without changes.

### `02-fps.ts` — sustained playback frame rate

- Loads `10-video-grid`, calls `player.play()`, samples `requestAnimationFrame` callbacks inside the iframe for 5 s.
- Crucial sequencing: install the rAF sampler **before** `play()`, wait for `__player.isPlaying() === true`, **then reset the sample buffer** — otherwise the postMessage round-trip ramp-up window drags the average down by 5–10 fps.
- FPS = `(samples − 1) / (lastTs − firstTs in s)`; uses rAF timestamps (the same ones the compositor saw) rather than wall-clock `setTimeout`, so we're measuring real frame production.
- Dropped-frame definition matches Chrome DevTools: gap > 1.5× (1000/60 ms) ≈ 25 ms = "missed at least one vsync."
- Aggregation across runs: `min(fps)` and `max(droppedFrames)` — worst case wins, since the proposal asserts a floor on fps and a ceiling on drops.
- Emits `playback_fps_min` (higher-is-better, baseline `fpsMin = 55`) and `playback_dropped_frames_max` (lower-is-better, baseline `droppedFramesMax = 3`).

### `04-scrub.ts` — scrub latency, inline + isolated

- Loads `10-video-grid`, pauses, then issues 10 seek calls in two batches: first the synchronous **inline** path (`<hyperframes-player>`'s default same-origin `_trySyncSeek`), then the **isolated** path (forced by replacing `_trySyncSeek` with `() => false`, which makes the player fall back to the postMessage `_sendControl("seek")` bridge that cross-origin embeds and pre-#397 builds use).
- Inline runs first so the isolated mode's monkey-patch can't bleed back into the inline samples.
- Detection: a rAF watcher inside the iframe polls `__player.getTime()` until it's within `MATCH_TOLERANCE_S = 0.05 s` of the requested target. Tolerance exists because the postMessage bridge converts seconds → frame number → seconds, and that round-trip can introduce sub-frame quantization drift even for targets on the canonical fps grid.
- Timing: `performance.timeOrigin + performance.now()` in both contexts. `timeOrigin` is consistent across same-process frames, so `t1 − t0` is a true wall-clock latency, not a host-only or iframe-only stopwatch.
- Targets alternate forward/backward (`1.0, 7.0, 2.0, 8.0, 3.0, 9.0, 4.0, 6.0, 5.0, 0.5`) so no two consecutive seeks land near each other — protects the rAF watcher from matching against a stale `getTime()` value before the seek command is processed.
- Aggregation: `percentile(95)` across the pooled per-seek latencies from every run. With 10 seeks × 2 modes × 3 runs we get 30 samples per mode per CI shard, enough for a stable p95.
- Emits `scrub_latency_p95_inline_ms` (lower-is-better, baseline `scrubLatencyP95InlineMs = 33`) and `scrub_latency_p95_isolated_ms` (lower-is-better, baseline `scrubLatencyP95IsolatedMs = 80`).

### `05-drift.ts` — media sync drift

- Loads `10-video-grid`, plays 6 s, instruments **every** `video[data-start]` element with `requestVideoFrameCallback`. Each callback records `(compositionTime, actualMediaTime)` plus a snapshot of the clip transform (`clipStart`, `clipMediaStart`, `clipPlaybackRate`).
- Drift = `|actualMediaTime − ((compTime − clipStart) × clipPlaybackRate + clipMediaStart)|` — the same transform the runtime applies in `packages/core/src/runtime/media.ts`, snapshotted once at sampler install so the per-frame work is just subtract + multiply + abs.
- Sustain window is 6 s (not the proposal's 10 s) because the fixture composition is exactly 10 s long and we want headroom before the end-of-timeline pause/clamp behavior. With 10 videos × ~25 fps × 6 s we still pool ~1500 samples per run — more than enough for a stable p95.
- Same "reset buffer after play confirmed" gotcha as `02-fps.ts`: frames captured during the postMessage round-trip would compare a non-zero `mediaTime` against `getTime() === 0` and inflate drift by hundreds of ms.
- Aggregation: `max()` and `percentile(95)` across the pooled per-frame drifts. The proposal's max-drift ceiling of 500 ms is intentional — the runtime hard-resyncs when `|currentTime − relTime| > 0.5 s`, so a regression past 500 ms means the corrective resync kicked in and the viewer saw a jump.
- Emits `media_drift_max_ms` (lower-is-better, baseline `driftMaxMs = 500`) and `media_drift_p95_ms` (lower-is-better, baseline `driftP95Ms = 100`).

### Wiring

- `packages/player/tests/perf/index.ts`: add `fps`, `scrub`, `drift` to `ScenarioId`, `DEFAULT_RUNS`, the default scenario list (`--scenarios` defaults to all four), and three new dispatch branches.
- `packages/player/tests/perf/perf-gate.ts`: add `droppedFramesMax: number` to `PerfBaseline`. Other baseline keys for these scenarios were already seeded in #399.
- `packages/player/tests/perf/baseline.json`: add `droppedFramesMax: 3`.
- `.github/workflows/player-perf.yml`: three new matrix shards (`fps` / `scrub` / `drift`) at `runs: 3`. Same `paths-filter` and same artifact-upload pattern as the `load` shard, so the summary job aggregates them automatically.

## Methodology highlights

These three patterns recur in all three scenarios and are worth noting because they're load-bearing for the numbers we report:

1. **Reset buffer after play-confirmed.** The `play()` API is async (postMessage), so any samples captured before `__player.isPlaying() === true` belong to ramp-up, not steady-state. Both `02-fps` and `05-drift` clear `__perfRafSamples` / `__perfDriftSamples` *after* the wait. Without this, fps drops 5–10 and drift inflates by hundreds of ms.
2. **Iframe-side timing.** All three scenarios time inside the iframe (`performance.timeOrigin + performance.now()` for scrub, rAF/RVFC timestamps for fps/drift) rather than host-side. The iframe is what the user sees; host-side timing would conflate Puppeteer's IPC overhead with real player latency.
3. **Stop sampling before pause.** Sampler is deactivated *before* `pause()` is issued, so the pause command's postMessage round-trip can't perturb the tail of the measurement window.

## Test plan

- [x] Local: `bun run player:perf` runs all four scenarios end-to-end on the 10-video-grid fixture.
- [x] Each scenario produces metrics matching its declared `baselineKey` so `perf-gate.ts` can find them.
- [x] Typecheck, lint, format pass on the new files.
- [x] Existing player unit tests untouched (no production code changes in this PR).
- [ ] First CI run will confirm the new shards complete inside the workflow timeout and that the summary job picks up their `metrics.json` artifacts.

## Stack

Step `P0-1b` of the player perf proposal. Builds on:

- `P0-1a` (#399): the harness, runner, gate, and CI workflow this PR plugs new scenarios into.

Followed by:

- `P0-1c` (#401): `06-parity` — live playback frame vs. synchronously-seeked reference frame, compared via SSIM, on the existing `gsap-heavy` fixture from #399.
@vanceingalls vanceingalls force-pushed the perf/p0-1c-live-playback-parity-test branch from 400409b to ac77a6c Compare April 23, 2026 01:10
@graphite-app graphite-app Bot changed the base branch from graphite-base/401 to main April 23, 2026 01:10
Adds the 06-parity scenario, which compares a live-playback frame at
~5s on the gsap-heavy fixture against a synchronously-seeked reference
at the same captured timestamp. The two PNG screenshots are diffed via
ffmpeg's SSIM filter; the run reports parity_ssim_min across runs as a
higher-is-better metric (baseline 0.95, allowed regression ratio 0.1
yields effective gate >= 0.855).

The iframe-side rAF watcher pauses the timeline in the same tick that
getTime() crosses 5.0s so the frozen value can be used as the
canonical T_actual for both captures. After two host-side rAF ticks
for paint settlement the actual frame is screenshotted, then el.seek()
(which routes through _trySyncSeek for same-origin iframes) lands the
player on the same time and a second screenshot is taken as the
reference. Per-run PNGs and the captured time are persisted under
results/parity/run-N/ for CI artifact upload and local debugging.

Wires the scenario into index.ts (ScenarioId, dispatcher, DEFAULT_RUNS
= 3), adds a parity shard to player-perf.yml, and adds paritySsimMin
to baseline.json + the PerfBaseline type so the gate can evaluate it.
@vanceingalls vanceingalls force-pushed the perf/p0-1c-live-playback-parity-test branch from ac77a6c to f606f0d Compare April 23, 2026 01:10
// ffmpeg's lavfi parser uses '\:' to escape the path separator inside
// a filter argument. We don't expect ':' in `statsPath` but escape
// defensively to keep this robust on weird mounts.
`ssim=stats_file=${statsPath.replace(/:/g, "\\:")}`,
@vanceingalls vanceingalls merged commit 80e7cd2 into main Apr 23, 2026
26 of 27 checks passed
Copy link
Copy Markdown
Collaborator Author

Merge activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants